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Abstract 

We  present  a  block-based,  automatic  partitioning  ^lnd  scheduling  method¬ 
ology  for  sparse  matrix  factorization  on  distributed  memory  systems.  Using 
experimental  results,  we  analyze  this  technique  for  coiiununication  and  load 
imbalance  overhead.  To  study  the  performance  effects,  we  compEire  these  over¬ 
heads  with  those  obtained  from  a  strjiightforward  “wrap-mapped”  column  as¬ 
signment  scheme.  All  experimental  results  were  obtained  using  test  spmse 
matrices  from  the  Harwell- Boeing  data  set.  The  results  show  that  there  is  a 
communication  and  load  balance  trade-off.  The  block-based  method  results  in 
lower  commimication  cost  whereas  the  wrap-mapped  scheme  gives  better  load 
balance. 
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under  NASA  contract  NASl-18605  while  the  first  author  was  in  residence  at  ICASE,  Mail  Stop 
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1  Introduction 


!*^artitioning  and  scheduling  the  parallel  execution  of  large  scientific  applications  on 
distributed  memory  systems  is  a  difficult  and  time  consuming  task.  If  the  dependen¬ 
cies  involved  are  unstructured,  as  in  the  case  of  sparse  linear  systems,  then  the  task 
becomes  even  more  complex.  Use  of  naive  techniques  to  extract  parallelism  often 
results  in  large  communication  overhead  and/or  in  large  load  imbalance.  To  reduce 
communication  overhead,  locality  of  data  must  be  exploited  and  to  balance  the  load, 
the  computations  must  be  evenly  distributed  at  all  times.  When  the  data  depen¬ 
dencies  are  non-uniform  and  unstructured,  achieving  these  two  goals  simiiltaneously 
is  difficult.  As  a  result,  in  such  cases,  the  overall  performance  may  turn  out  to  be 
poor,  even  if  an  application  has  a  high  degree  of  extractable  parallelism.  One  possible 
way  to  minimize  the  overhead  is  to  make  use  of  the  structure  of  the  sparse  system 
which  can  usually  be  determined  prior  to  performing  the  numerical  computations. 
When  direct  methods  are  used  to  solve  the  sparse  systems,  this  information  in  the 
form  of  the  structure  of  the  factored  matrix  is  routinely  used  to  reduce  computation 
and/or  storage  costs.  Recently,  this  information  has  also  been  applied  in  extracting 
parallelism  while  maintaining  low  communication  and  load  imbalance  costs  [5],  [6], 
[14].  However,  in  most  cases,  parallelism  has  been  extracted  manually,  wliich  tends 
to  be  extremely  tedious,  error  prone,  and  inflexible.  Thus,  automation  is  the  key  to 
successful  parallelization  of  such  applications.  To  summarize,  there  are  two  important 
issues  in  the  efficient  parallelization  of  sparse  matrix  based  computations: 


•  Developing  technology  for  the  automatic  parallelization  of  the  computations. 

•  Developing  a  methodology  for  the  extraction  of  the  available  parallelism  with 
minimum  communication  and  load  imbalance  costs. 

To  address  these  issues,  we  have  developed  an  automatic,  block-based  scheme  for  par¬ 
titioning  and  scheduling  the  computations  in  factoring  a  sparse  matrix.  The  scheme 
makes  use  of  the  structure  of  the  factor  and  is  targeted  towards  distributed  memory 


systems.  To  reduce  communication,  it  takes  advantage  of  locality.  However,  to  main-  _ 

tain  proper  load  balance  and  a  high  degree  of  parallelism,  the  scheme  makes  use  of  "’f’ 
an  adaptive  technique  in  distributing  the  computational  work.  .'.'  *.1 

To  demonstrate  the  usefulness  of  such  a  partitioning  scheme  and  to  bring  out  the  ced 
performance  limitations  that  are  inherent  in  sparse  matrix  computations,  we  compare 


the  communication  overhead  and  the  degree  of  load  balance  in  the  automated  block- 


based  approach  with  that  obtained  from  a  straightforward  and  widely  us<xl  column- 
based  approach.  In  the  latter  scheme,  computations  associated  with  an  entire  column 


or  row  are  assigned  to  a  processor  and  the  assignment  of  these  columns  or  rows  is 
usually  done  in  a  “wrap-around”  fashion.  We  rehu  to  this  scheme  as  the  wrap-mapping  . 


liity  Codes 
-i  and/or 
pocled 


or  wrap  scheme.  For  comparing  the  performance  on  practical  applications,  we  present 
results  for  some  of  the  Harwell-Boeing  test  matrices. 

In  the  following  discussion,  it  is  assumed  that  the  reader  is  familiar  with  the  standard 
terminology  used  in  the  context  of  sparse  matrix  computations.  For  an  explanation, 
see  [7], [3]. 

The  organization  of  the  rest  of  the  paper  is  as  follows.  In  the  next  section,  the 
Cholesky  factorization  is  briefly  described  and  some  of  the  terminology  used  in  the 
paper  is  introduced.  The  partitioning  and  scheduling  strategies  that  are  used  for 
automation  are  presented  in  Section  3.  Performance  results  are  described  in  Section  4 
and  Section  5  concludes  the  paper. 


2  Cholesky  factorization 


The  partitioning  and  scheduling  methodology  is  described  in  this  paper  assuming 
Cholesky  factorization  as  the  model  numerical  method  of  computation.  The  Cholesky 
algorithm  is  simple,  well  understood,  and  is  widely  used.  Note,  however,  that  the 
techniques  presented  here  are  applicable  to  other  factoring  methods  as  well.  In  the 
following,  we  highlight  only  those  aspects  of  this  algorithm  that  are  essential  for 
describing  the  partitioner.  For  details  on  the  Cholesky  factorization  scheme,  see  [9]. 

For  the  sake  of  completeness,  we  first  briefly  describe  the  four  steps  involved  in  the 
direct  solution  of  Ax  —  b.  (For  details  see,  for  example,  [8].)  It  is  assumed  that  A 
is  symmetric,  positive  definite  and  that  Cholesky  factorization  is  used  in  computing 
the  factor  L,  where  A  =  LL^ . 

1.  Ordering:  Find  a  good  ordering  of  the  unknowns  for  elimination.  The  ordering 
is  given  by  a  permutation  matrix  P.  Most  often,  a  “good”  ordering  implies 
one  which  would  lead  to  a  sparse  factor  and  fewer  arithmetic  operations  in  the 
numerical  factorization  step. 

2.  Symbolic  Factorization:  Determine  the  sparsity  structure  of  the  factor  L. 

3.  Numerical  Factorization:  Compute  L. 

4.  Triangular  Solutions:  Using  the  computed  L,  solve  the  triangular  systems  Lu  = 
Pb,  =  u  and  set  x  =  P^v. 


The  basic  element-level  data  dependencies  in  the  factorization  process  are  shown  in 
Figure  1. 
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Figure  1:  Inter-element  dependencies  in  Cholesky  factorization 

In  that  figure,  only  the  lower  triangular  part  of  the  matrix  to  be  factored  is  shown. 

denotes  the  element  in  row  i  and  column  j.  The  direction  of  the  arrows  indicates 
the  data  fiow.  Thus,  elements  Lj^k  and  from  column  k  of  the  factor  L  are  required 
in  computing  element  Lij.  Lij  =  Lij  —  Li^k  *  Lj^k  is  the  corresponding  operation  in 
the  Cholesky  factorization.  (Initially  Lij  is  set  to  Aij.)  We  refer  to  this  operation  as 
a  single  update  operation.  Note  that  in  computing  the  final  value  of  it  must  be 
updated  by  all  pairs  of  non-zero  elements  Ljk  ^nd  L,  jj;,  1  <  k  <  j.  Finally,  after  all 
the  updates  are  performed,  the  element  is  scaled  by  the  square  root  of  the  diagonal 
element  in  that  column. 


3  Partitioning  and  scheduling 


The  partitioning  scheme  presented  here  is  static  in  the  sense  that  all  tl  o  computations 
are  partitioned  before  any  of  the  computations  are  scheduled  for  execution.  For  this, 
the  partitioner  takes  as  an  input  the  structure  of  the  factor  for  the  sparse  matrix. 
However,  the  scheme  is  general  and  does  not  have  knowledge  of  any  matrix  structure 
embedded  in  it. 

As  stated  in  the  introductory  section,  the  aim  of  the  partitioner  and  the  scheduler 
is  to  reduce  communication  and  at  the  same  time  maintain  a  balanced  work  load 
among  processors  at  all  times.  To  achieve  this,  wherever  possible,  data  locality  is 
exploited.  This  leads  to  some  variation  of  block-based  partitioning;  such  partitioning 
approaches  have  been  proposed  in  several  linear  algebra  related  problems  [2],  [12]. 
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With  blocking,  it  is  possible  to  achieve  a  high  ratio  of  computation  to  communication 
per  block.  In  [11],  it  is  shown  that  lor  an  important  class  of  problems,  the  block-based 
partitioning  schemes  result  in  an  optimal  utilization  of  the  data  accessed  (and  thus 
reduce  data  traffic).  Blocking,  however,  could  lead  to  load  imbalance  because  the 
increase  in  the  size  of  schedulable  units  results  in  a  loss  of  flexibility  in  distributing 
work  among  processors.  To  avoid  this,  the  partitioner  described  here  partitions  the 
factored  matrix  into  blocks  of  varying  sizes  that  can  be  assigned  in  an  equitable 
manner  to  the  processors.  It  makes  use  of  a  heuristic  where  the  block  sizes  are 
subject  to  adaptive  manipulation.  In  the  following  we  describe  the  functioning  of  the 
partitioner  in  some  detail. 

The  partitioning  starts  with  the  zero-nonzero  structure  of  the  filled  sparse  matrix 
obtained  after  the  symbolic  factorization  phase  has  been  completed.  Blocks  of  non¬ 
zero  areas  are  identified  in  the  filled  matrix.  Wc  refer  to  these  as  dense  blocks.  On 
occasions,  blocks  are  formed  by  including  small  regions  that  correspond  to  zeros  in 
the  factored  matrix  in  order  to  obtain  larger  blocks.  Inclusion  of  such  areas  with  zero 
elements  is  kept  to  a  minimum.  The  work  in  these  dense  blocks  is  partitioned  into 
sub- blocks  which  are  the  basic  schedulable  units.  These  unit  blocks  have  a  regular 
shape  -  each  unit  block  is  either  a  column,  a  rectangle  or  a  triairgle.  After  all  the  unit 
blocks  are  identified,  the  dependencies  between  these  blocks  are  determined.  Finally 
the  unit  blocks  arc  assigned  and  scheduled  on  processors. 

Thus,  the  steps  involved  in  the  automatic  partitioning  and  scheduling  are: 

•  Identify  dense  blocks  in  the  symbolic  factor. 

•  Partition  each  dense  block  into  schedulable  unit  blocks. 

•  Generate  and  store  dependency  information  for  the  unit  blocks. 

•  Schedule  these  units  on  the  processors  of  a  message  passing  system. 

•  Consolidate  the  non-locaJ  memory  access  information  for  each  processor  so  as 
to  minimize  communication  overhead. 


In  the  remainder  of  this  section,  we  will  describe  the  first  four  steps. 


3.1  Identification  of  dense  blocks 

To  identify  the  dense  blocks,  first  clusters  of  columns  are  determined  in  the  sparse 
triangular  factor.  A  cluster  is  a  either  a  column  or  a  strip  of  consecutive  columns. 
If  it  is  a  strip,  it  contains  a  dense  triangular  block  at  the  top  and  (possibly)  a  .set  of 
off-diagonal  dense  rectangular  blocks.  This  is  illustrated  using  an  example  shown  in 
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Figure  2.  In  that  figure,  non-zero  elements  in  the  filled  41  x  41  matrix  are  indicated 
by  the  dark  areas.  The  matrix  corresponds  to  a  5-point  finite  element  5x5  grid  and  is 
ordered  using  Liu’s  multiple  minimum  degree  algorithm  [10].  It  was  generated  using 
the  Sparse  Matrix  Manipulation  System  developed  at  the  University  of  Wisconsin  [1]. 


Figure  2:  A  41  x  41  filled  matrix. 

In  Figure  2,  note  the  following  in  the  lower  triangular  part.  Cluster  1  spans  columns 
1  and  2  and  cluster  2  spans  columns  3  and  4.  Both  clusters  1  and  2  have  a  three- 
element  dense  triangular  block  at  the  diagonal.  Cluster  1  has  three  dense  rectangles 
below  the  triangle,  each  of  which  is  1  x  2,  while  cluster  2  has  two  dense  rectangles,  the 
upper  one  being  1x2  and  the  lower  one  being  2x2.  Clusters  3  through  12  are  single 
columns  starting  with  cluster  3  at  column  5.  The  last  cluster  consists  of  columns  35 
through  41.  This  cluster  has  one  dense  triangle  and  no  rectangles  below  it.  Note  that 
in  this  illustration  we  do  not  consider  column  34  as  part  of  the  last  cluster  because 
of  the  zero  in  row  38  of  this  column.  But  this  can  be  over-ridden  by  allowing  some 
zeros  to  be  a  part  of  a  triangle. 

Once  the  clusters  and  the  triangular  and  rectangular  blocks  within  each  cluster  are 
identified,  the  algorithm  processes  the  clusters  left  to  right  in  the  matrix.  When  a 
cluster  is  processed,  each  block  in  the  cluster  is  partitioned  into  sub-blocks  which  are 
schedulable  units.  Next,  for  each  unit,  the  dependencies  are  determined  and  stored. 
These  steps  are  explained  below. 
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3.2  Partitioning  of  a  block 


A  cluster  with  a  single  column  is  considered  to  be  a  schedulable  unit  and  is  not 
subject  to  further  partitioning.  In  a  multi-column  cluster,  the  triangular  block  is 
partitioned  first.  In  general,  the  number  of  partitions  of  a  triangle  are  determined  by 
(a)  the  number  of  processors  that  are  assigned  to  the  blocks  on  which  the  triangle 
depends,  (b)  a  certain  minimum  work  requirement  per  unit  sub-block.  The  first  pa¬ 
rameter  restricts  communication  to  the  group  of  processors  that  work  on  the  triangle 
and  its  predecessors.  The  second  parameter  is  used  to  ensure  a  satisfactory  ratio  of 
computation  to  communication  for  each  unit  block  and  is  an  architecture  dependent 
parameter.  This  parameter  may  be  used  to  vary  block  sizes  from  one  cluster  to  the 
next.  For  the  results  presented  here  we  use  a  fixed  size  -  one  for  all  the  triangular 
block  and  another  for  the  rectangular  blocks.  This  is  referred  to  as  the  grain  size  and 
is  the  minimum  number  of  matrix  elements  required  in  each  unit  block.  The  grain 
size  dictates  a  maximum  number  of  partitions,  say  Pj.  A  block  is  partitioned  into  at 
most  Pj  equal  sized  units;  at  most  because  it  may  not  always  be  possible  to  break  up 
a  block  into  exactly  Pd  equal  sized  units. 


Figure  3;  Partitions 

Figtirc  3  illustrates  this  partitioning.  The  triangle  is  partitioned  into  six  parts.  One 
of  tin;  rectangles  is  partitioned  into  four  parts  and  the  other  is  partitioned  into  three 
parts. 
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3.3  Identification  of  dependencies 


The  dependencies  in  a  single  update  operation  at  the  element  level  of  Cholesky  fac¬ 
torization  are  shown  in  Figure  1.  However,  for  allocation  and  scheduling  of  the  units, 
it  is  necessary  to  identify  the  dependencies  at  the  block  level.  In  this  step,  for  each 
unit  block,  the  dependencies  are  determined  and  the  information  on  the  actual  data 
needed  in  the  update  operation  is  stored.  This  step  also  identifies  columns  or  block 
units  that  are  independent,  i.e.,  those  that  are  not  updated  by  any  other  units.  To 
automate  this  process,  it  is  necessary  to  classify  the  dependencies  at  the  inter-block 
level.  We  have  classified  these  dependencies  into  ten  categories  which  are  enumerated 
next.  Using  this  classification  and  the  interval  tree  structure,  the  partitioner  computes 
the  dependencies  efficiently.  The  implementation  details  are  given  elsewhere. 

In  the  following  discussion,  a  column  is  represented  by  its  column  number  in  the 
matrix,  a  rectangle  is  represented  by  its  column  extent  (c^,  Cj),  Ci  <  cj  and  row  extent 
’’p  ^  f'q,  and  a  triangle  is  represented  by  its  row  extent  (or  column  extent, 
which  is  the  same  as  the  row  extent)  (r,-,rj),  <  rj. 

1.  A  column  updates  a  column 

This  forms  the  base  case  for  the  dependencies.  A  column  k  updates  a  column 
j  if  Lj^k  is  non-zero,  (see  Figure  1). 

2.  A  column  updates  a  triangle 

Let  triangle  T's  row  extent  be  (ri,r2).  A  column  k,  k  <  ri,  updates  the  triangle 
if  is  non- zero,  rj  <  f  <  rj.  In  Figure  4(a),  the  non- zero  elements  of  column 
k  that  are  involved  in  the  update  are  in  rows  ii,  12  and  13.  The  points  of 
intersection  of  the  dotted  lines  with  each  other  and  of  the  dotted  lines  with  the 
diagonal  are  the  points  of  triangle  T  that  are  updated  by  column  k. 

3.  A  column  updates  a  rectangle 

Let  rectangle  iE’s  column  extent  be  (ci,  C2)  and  row  extent  be  (r-i,  Pj)-  A  column 
k  updates  this  rectangle  if  it  has  pmrs  of  non-zero  elements  L{  k  and  Lj^k,  where 
Cl  <  i  <  C2  and  ri  <  j  <  r2.  In  Figure  4(b),  the  non-zero  elements  in  rows  ii 
and  12  combine  with  the  non-zero  elements  in  rows  ji,  72  and  js  to  update  a 
portion  of  R.  This  updated  portion  is  the  set  of  points  given  by  the  intersection 
of  the  dotted  lines  in  R's  interior. 

4.  \  triangle  updates  a  rectangle 

Let  the  column  extent  of  rectangle  R  be  (cj,  C2)  and  the  column  extent  of  triangle 
T  be  (03,04).  The  triangle  updates  the  rectangle  if  there  is  an  intersection  in 
their  column  extents.  In  Figure  4(c),  the  shaded  portion  of  T  updates  the 
shaded  portion  of  R. 
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5.  A  triangle  and  a  rectangle  update  a  rectangle 


Let  rectangle  Ri  have  column  extent  (ci,C2)  and  row  extent  {rx^r^)  and  let 
rectangle  i?2  have  column  extent  (c3,C4)  and  row  extent  (r3,r4).  Let  C2  <  C3. 
Let  the  column  extent  of  triangle  T  be  (csjCe).  T  combines  with  Ri  to  update 
i?2  if  (ci,C2)  intersects  (csjCe),  (03,04)  intersects  (05,06)  and  (ri,r2)  intersects 
{rz^Tx)-  In  Figure  4(d),  the  shaded  rectangular  portion  of  T  combines  with  the 
entire  shaded  rectangle  Ri  to  update  the  entire  shaded  rectangle  Rz- 

6.  A  rectangle  updates  a  column 

Let  the  row  extent  of  rectangle  R  be  (j“i,r2).  It  updates  a  column  k  iirx  <  k  < 
rz-  In  Figure  4(e),  the  shaded  portion  of  the  rectangle  between  rows  k  and  T2 
update  the  column  elements  between  rows  k  and  rz- 

7.  Two  rectangles  update  a  column 

Let  rectangle  Ri  have  column  extent  (01,02)  and  row  extent  (ri,?^)  and  let 
rectangle  R2  have  column  extent  (03,04)  and  row  extent  (t-3,7'4).  Let  rz  <  rz. 
Then  Ri  combines  with  Rz  to  update  a  column  k  if  ri  <  k  <  rz  and  (01,02) 
intersects  (03,04).  In  Figure  4(f),  the  elements  of  Ri  which  are  in  the  row  k 
between  the  vertical  dotted  lines  combine  with  the  entire  shaded  rectangle  Rz 
to  update  the  elements  between  rows  T3  and  r4  in  column  k. 

8.  A  rectangle  updates  a  triangle 

Let  the  row  extent  of  rectangle  Ri  be  (ri,r2)  and  the  row  extent  of  triangle  T 
be  (r3,r4).  The  rectangle  updates  the  triangle  if  (ri,r2)  intersects  (r3,r4).  In 
Figure  4(g),  the  shaded  portion  of  R  updates  the  shaded  portion  of  T. 

9.  Two  rectangles  update  a  triangle 

Let  rectangle  Ri  have  column  extent  (01,02)  and  row  extent  (ri,r2)  and  let 
rectangle  Rz  have  column  extent  (03,04)  and  row  extent  (r’3,^4)-  Let  rz  <  rz- 
Let  the  row  extent  of  triangle  T  he  (^*5,06).  Then  Ri  combines  with  Rz  to 
update  T  if  (01,02)  intersects  (03,04)  and  {ti^tz)  intersects  (r5,7-6)  and  (7-3,  r4) 
intersects  {r5,re)-  In  Figure  4(h),  the  shaded  portion  of  Ri  combines  with  the 
entire  shaded  rectangle  Rz  to  update  the  shaded  rectangular  portion  of  T. 

10.  Two  rectangles  update  a  rectangle 

Let  rectangle  Ri  have  column  extent  (oi,  02)  and  row  extent  (ri,  r2),  rectangle  Rz 
have  column  extent  (03, 04)  and  row  extent  {rz-,  r^)  and  rectangle  Rz  have  column 
extent  (05,05)  and  row  extent  (05, re).  Let  rz  <  rz,  rz  <  r^  and  04  <  05.  Then  Ri 
combines  with  Rz  to  update  Rz  if  (oi,  03)  intersects  (03, 04)  and  (r3,  r^)  intersects 
(7’5,t’6)  and  (ri,r2)  intersects  (05,05).  In  Figure  4(i),  the  shaded  portion  of  Ri 
combines  with  the  shaded  portion  of  Rz  to  update  the  shaded  part  of  Rz- 
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3.4  Scheduling 


The  scheduling  process  is  split  up  into  two  parts:  allocating  unit  blocks  to  processors 
and  ordering  the  computational  work  within  each  processor.  In  this  paper,  we  are 
concerned  with  the  first  part  only  and  the  salient  points  therein  are  presented  next. 

First  the  independent  columns,  as  identified  in  the  previous  step,  are  allocated  to 
processors  in  a  wrap-around  fashion.  The  remaining  clusters  are  scanned  again  from 
left  to  right.  If  a  cluster  is  a  dependent  column,  the  entire  column  is  allocated  to  a 
processor,  which  is  arbitrarily  picked  from  the  set  of  processors  which  worked  on  the 
column’s  predecessors.  If  the  cluster  is  not  a  column,  the  unit  blocks  in  the  triangular 
part  are  allocated  to  processors,  followed  by  the  unit  blocks  in  each  rectangular  block, 
going  top  to  bottom.  For  example,  in  the  cluster  shown  in  Figure  3,  the  six  sub-blocks 
of  the  triangle  would  be  allocated  first,  followed  by  the  four  sub-blocks  of  the  rectangle 
below  it,  finishing  up  with  the  three  sub-blocks  of  the  bottom-most  rectangle. 

Allocation  within  a  triangle  proceeds  by  first  allocating  the  triangular  units  from 
top  to  bottom,  followed  by  the  rectangular  units,  going  top  to  bottom  and  left  to 
right.  In  the  Figure  3  for  instance,  the  sub-blocks  in  the  triangle  would  be  allocated 
in  the  order  ti,  ta,  te,  t2,  <4,  U-  A  global  set  of  all  processors,  Pg,  is  maintained, 
with  a  marker  pointing  to  the  first  “available”  processor.  This  marker  cycles  through 
the  global  set  in  a  round-robin  fashion  and  is  moved  up  every  time  a  unit  block  is 
allocated  to  the  currently  available  processor.  Apart  from  this,  a  set  of  processors.  Pa) 
which  have  been  already  allocated  to  some  sub-block  in  the  triangle  is  maintained. 
Initially,  Pa  is  empty.  The  strategy  for  allocating  a  processor  to  a  unit  rectangle  or 
unit  triangle  is  the  same.  First,  the  predecessors  of  the  unit  block  are  scanned.  For 
each  predecessor,  if  the  processor  p  which  worked  on  it  is  not  in  Po,  the  unit  block 
is  allocated  to  p  and  p  is  added  to  Pa.  If  all  of  the  processors  which  worked  on  all 
the  predecessors  of  the  unit  block  are  already  in  Pa,  the  unit  block  is  alle''ated  to  the 
currently  available  processor  in  Pg  and  the  marker  is  moved  up  to  the  next  processor 
in  Pg. 

For  allocating  the  units  within  a  rectangle  below  the  triangle,  the  choice  of  processors 
is  restricted  to  Pf,  where  Pt  is  the  set  of  processors  to  which  the  unit  blocks  in  the 
triangle  are  allocated.  Since  there  is  a  large  amount  of  communication  between  a  tri¬ 
angle  and  the  rectangles  below  it,  this  strategy  helps  in  reducing  the  communication. 
First,  the  processors  in  set  Pt  are  ordered  according  to  increasing  work.  Going  in 
round- robin  fashion  through  Pt,  the  processors  are  assigned  to  the  unit  blocks  in  the 
rectangle,  going  top  to  bottom  and  left  to  right  within  the  rectangle.  For  example, 
let  processors  pi,  p2  and  pa  be  assigned  to  the  unit  blocks  on  the  triangle  in  Figure  3. 
Assume  that  the  ordering  according  to  work  is  such  that  pi  <  pa  <  pa-  Then,  in 
the  first  rectangle  below  the  triangle,  rn  is  allocated  to  pi,  ria  is  allocated  to  pa,  ria 
is  allocated  to  pa,  ri4  is  allocated  to  pi.  The  set  Pt  is  sorted  again  and  the  above 
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strategy  is  used  to  allocate  r2i,  7-22  and  r^z- 


4  Performance 


In  this  section  we  present  results  on  the  performance  of  the  above  described  partitioner 
and  scheduler,  in  terms  of  the  quality  of  partitioning  and  allocation  that  it  produces. 
To  quantify  the  results,  we  measure  the  communication  overhead  in  terms  of  the  total 
data  traffic  generated  and  the  load  balance  in  terms  of  a  fctctor  that  measures  the 
deviation  from  perfect  load  balance.  We  also  compare  the  results  with  those  using 
the  straightforward  column  wrap  assignment  scheme.  For  this  purpose,  we  have 
used  some  of  the  representative  test  matrices  from  the  Harwell-Boeing  package  [4]. 
These  test  matrices  were  partitioned  and  the  work  units  were  scheduled  as  described 
in  the  previous  section.  Using  this  output,  simidations  were  carried  out  to  get  the 
performance  rcsidts  presented  here. 


Application 

No.  of 

eqns. 

No.  of 

non-zeros 

No.  of 

non-zeros 

in  factor 

Description 

T^US1138 

1138 

2596 

3304 

Symmetric  structure  of  power 
system  networks 

CANN1072 

1072 

6758 

20512 

Symmetric  pattern  from 

Cannes,  Lucien  Marro 

DWT512 

512 

2007 

3786 

Symmetric  submarine  frame 
from  Naval  Ship  Research 
and  Development  Center 

LAP30 

900 

4322 

16697 

Symmetric  matrix  representing 
9-point  discretization  of  the 
Laplacian  on  the  unit  square  w/ 
Dirichlet  boundary  conditions 

LSHP1009 

1009 

3937 

18268 

Symmetric  matrix  from 

Alan  George’s  LSHAPE  probs. 

Table  1:  Selected  Harwell-Boeing  Test  Matrices 

For  all  the  results  presented  in  this  section,  the  test  matrices  were  ordered  using  Liu’s 
modified  multiple  minimum  degree  ordering  scheme  [10].  We  used  some  of  the  tools 
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from  SPARSKIT  [13]  and  the  Wisconsin  Sparse  Matrix  Manipulation  System  [1]  for 
converting  the  test  matrices  into  various  formats,  and  for  ordering  and  symbolically 
factoring  the  matrices.  Table  1  describes  the  Harwell- Boeing  test  matrices  which  were 
used  in  our  experiments. 

In  the  following,  we  first  quantify  the  communication  and  work  load  distribution 
aspects  of  the  partitioning  schemes.  Note  that  here  we  are  concerned  with  the  quality 
of  the  partitioner/scheduler  in  distributing  the  work  among  the  processors  and  hence 
do  not  take  into  account  data  dependency  delays.  In  practice,  the  total  execution  may 
be  affected  by  the  dependency  delays  as  well.  However,  if  the  number  of  processors 
is  relatively  sm?Jl  compared  to  the  number  of  schedulable  units,  then  the  allocation 
scheme  described  here  provides  enough  parallelism  to  keep  the  idle  time  to  a  minimum. 

The  communication  cost  is  parameterized  by  the  total  data  traffic  generated  in  the 
system  and  the  mean  data  traffic  per  processor.  The  data  traffic  is  defined  as  a  count 
of  all  the  non-local  data  accesses.  Accessing  a  single  non-local  element  constitutes  a 
unit  data  traffic  irrespective  of  the  location  Rom  where  it  is  fetched.  Once  a  data 
element  is  fetched,  that  element  is  stored  locally  and  subsequent  usage  of  that  element 
in  the  local  computations  does  not  add  to  the  data  traffic.  The  total  data  traffic  in 
the  system  is  the  sum  of  the  data  accesses  by  all  the  processors  in  the  system.  This 
figure  represents  the  volume  of  the  data  that  must  be  transmitted  by  the  system 
during  the  entire  factorization  step. 

The  work  load  distribution  of  a  partitioning  scheme  is  characterized  as  follows.  The 
computation  cost  of  updating  an  element  of  the  matrix  by  a  pair  of  off-diagonal 
elements  is  assumed  to  be  two  units;  updating  th  ^  element  by  the  diagonal  element  is 
assumed  to  cost  one  unit.  The  computational  work  assigned  to  a  processor  is  the  sum 
of  the  computation  costs  of  all  the  elements  assigned  to  that  processor.  The  quality 
of  the  work  load  distribution  for  a  partitioning  scheme  is  measured  in  terms  of  the 
load  imbalance  resulting  from  the  assignment  of  the  work  to  the  processors.  The  load 
imbalance  factor  is  defined  as, 

,  _  {Wma.  -  Wave)  *  N 
Wtot 

where  Wtot  is  the  total  work,  N  is  the  number  of  processors.  Wave  =  Wtot/N  is  the 
average  work  and  Wmax  is  the  maximum  work  assigned  to  any  processor.  Note  that 
when  the  load  is  perfectly  distributed,  Wmax  is  Wave  and  A  is  zero.  The  load  imbalance 
factor  can  be  related  to  the  efficiency  e,  which  is  the  ratio  of  speedup  to  number  of 
processors,  where  speedup  is  the  ratio  of  sequential  time  to  parallel  time.  In  the  case 
of  zero  idle  times  due  to  dependency  delays,  the  parallel  time  is  simply  the  amount 
of  computational  work  in  the  processor  with  the  maximum  work.  The  efficiency  can 
then  be  expressed  as, 

Weot  _  Wav, 

^  W  *  M  W 

max  rnax 
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which  gives  ns 


.  _  {Wmax  -  Wave)  *  N  _  W^ax  -  _  1  _  -i 

Vytot  "  VFa«e 

Table  2  gives  the  communication  traffic  in  the  block  scheme  for  two  cases  respectively; 
when  the  grain  size  is  4  and  when  the  grain  size  is  25. 


Appl. 

P 

Total 

Mean 

g=4 

g=25 

g=4 

g=25 

4 

1335 

1194 

334 

298 

BUS 

16 

1818 

1567 

114 

98 

32 

1910 

1649 

60 

103 

4 

47545 

40716 

11886 

10179 

CANN 

16 

138453 

80334 

8653 

5021 

32 

171965 

89042 

5374 

2783 

4 

5336 

3768 

1334 

942 

DWT 

16 

10328 

5482 

645 

342 

32 

11305 

5950 

353 

185 

4 

38424 

29382 

9606 

7346 

LAP 

16 

100012 

44738 

6251 

2796 

32 

113717 

48863 

3554 

1527 

4 

42044 

29899 

10511 

7475 

LSHP 

16 

106973 

57773 

6686 

3611 

32 

127612 
_ 1 

60243 

3988 

1883 

Table  2:  Block  mapping  communication. 

Recall  that  the  grain  size  is  the  minimum  number  of  elements  in  any  triangular  or 
rectangular  partition.  In  both  cases,  total  communication  increases  with  the  number 
of  processors  for  all  the  test  problems.  However,  when  the  grain  size  is  increased  from 
4  to  25,  there  is  a  significant  reduction  in  communication.  For  instance,  in  the  LAP30 
problem,  the  g  =  4  and  g  =  25  columns  for  total  communication  in  table  2  show  that 
there  is  more  than  50%  reduction  in  the  total  communication  for  p  =  16  and  p  =  32. 
This  is  due  to  the  fact  that  as  the  block  size  increases,  more  work  is  done  in  each 
block  with  a  lot  of  re-use  of  data. 

Table  3  describes  the  work  distribution  in  the  block  scheme  for  grain  sizes  4  and  25. 
In  contrast  to  the  reduction  in  communication  with  higher  grain  size,  in  most  cases, 
there  is  an  increase  in  load  imbalance.  Furthermore,  the  load  imbalance  factor  A 
increases,  in  general,  with  the  number  of  processors,  as  well. 
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Overall,  the  larger  the  grain  size,  the  smaller  is  the  communication,  at  the  cost  of 
larger  load  imbalance.  If  the  application  is  run  on  a  system  with  high  communication 
cost  as  compared  to  computation  cost,  the  block-based  partitioning  can  give  good 
performance  i.e.  the  savings  in  communication  will  be  more  than  offset  the  disad¬ 
vantage  of  load  imbalance.  Also,  the  load  balance  can  be  improved  by  using  more 
sophisticated  strategies  to  allocate  blocks  to  processors. 


Appl. 

Procs. 

Work  Distribution 

Mean 

A 

g=4 

4 

2791 

0.77 

0.8 

BUS 

16 

698 

3.59 

3.59 

32 

349 

6.3 

6.3 

4 

151460 

yoi 

0.122 

GANN 

16 

37865 

MM 

0.62 

32 

18932 

yj^ 

1.26 

4 

11701 

jSli 

0.18 

DWT 

16 

2925 

1.37 

32 

1462 

3.67 

4 

108644 

0.12 

0.16 

LAP 

16 

27161 

0.13 

1.13 

32 

13581 

0.48 

2.9 

4 

125392 

0.24 

LSHP 

16 

31348 

mi 

0.74 

32 

15674 

2.04 

Table  3:  Block  mapping  work  distribution. 

Apart  from  grain  size,  another  parameter  used  in  the  tests  was  the  minimum  cluster 
width.  For  instance,  if  the  minimum  cluster  width  is  4,  no  strip  of  columns  less  than 
four  columns  wide  is  acceptable  as  a  cluster  -  it  is  broken  up  into  individual  columns. 
The  larger  the  minimum  width  acceptable,  the  fewer  number  of  non-single-column 
clusters  there  are.  For  any  problem,  if  the  cluster  width  is  set  high  enough,  we  end 
up  with  all  single  columns.  The  results  of  table  2  and  table  3  were  obtained  using  a 
minimum  cluster  width  of  four. 

Table  4  shows  the  variation  of  communication  and  load  distribution  with  minimum 
cluster  width  for  LAP30.  The  table  shows  an  increase  in  communication  when  the 
width  goes  from  2  to  4  and  then  a  decrease  when  the  width  goes  to  8.  Load  imbalance 
shows  a  complementary  behavior.  It  decreases  when  the  width  goes  from  2  from  4 
and  then  increases  when  the  width  goes  from  4  to  8.  The  cluster  width  has  to  go  in 
step  with  the  grain  size.  If  the  cluster  width  is  too  small  compared  to  the  grain  size, 
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a  large  number  of  skinny  clusters  would  be  formed  towards  the  left  of  the  matrix. 
The  blocks  would  not  have  enough  matrix  elements  to  take  advantage  of  reduction  in 
communication  offered  by  the  large  grain  size. 


Width 

P 

Communication 

Work  Distr. 

Total 

Mean 

Mean 

A 

4 

38936 

9734 

108644 

0.03 

2 

16 

96235 

6015 

27161 

0.167 

32 

111519 

3485 

13580 

0.54 

4 

38424 

9606 

108644 

wm 

4 

16 

100012 

6251 

27161 

Ira 

32 

113717 

3554 

13580 

ira 

4 

32569 

8142 

108644 

0.62 

8 

16 

88408 

5526 

27161 

1.35 

32 

101725 

3179 

13580 

2.3 

Table  4:  Variation  with  width  for  LAP30,  g  =  4. 

Table  5  presents  the  results  for  the  wrap-mapping  case.  The  immediately  noticeable 
property  is  the  consistently  uniform  load  distribution,  as  seen  by  the  A  column.  How¬ 
ever,  a  smaller  grain  size  in  the  block  scheme  gives  a  two-fold  advantage  of  decrease 
in  communication  without  too  much  load  imbalance  as  compared  to  wrap-mapping. 
For  instance,  consider  the  CANN1072  problem  with  32  processors.  For  a  grain  size 
of  four,  the  block  case  provides  a  28%  saving  in  communication  in  going  from  wrap 
mapping  to  the  block  scheme  while  the  load  imbalance  factor  goes  from  0.14  to  0.38, 
whereas  when  the  grain  size  is  25,  the  savings  in  communication  over  wrap-mapping 
is  63%  while  the  load  imbalance  factor  goes  from  0.14  to  1.26. 


5  Conclusions 


In  this  paper,  we  have  described  a  block  based,  automatic  partitioning  and  scheduling 
scheme  for  factoring  sparse  matrices  on  message  passing  systems.  The  primary  focus 
is  towards  automating  the  process  so  that  the  tedious  task  of  manual  parallelization 
is  kept  to  a  minimum.  The  partitioner  makes  use  of  data  locality  to  reduce  communi¬ 
cation  overhead  and  at  the  same  time  attempts  to  provide  the  necessary  flexibility  to 
the  scheduler  in  manipulating  the  work  allocation  so  that  the  load  remains  balanced. 
We  have  used  the  example  of  Cholesky  factorization  to  describe  the  methodology. 
However,  it  can  very  easily  be  adapted  to  other  factoring  methods  used  in  sparse 
matrix  computations.  In  fact,  it  can  be  generalized  to  computations  that  can  be 
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Appl. 

P 

Communication 

Work  Distr. 

Total 

Mean 

Mean 

A 

1 

0 

0 

11164 

KM 

BUS 

4 

2485 

621 

2791 

16 

3705 

231 

698 

W 

32 

3832 

120 

349 

1 

0 

■■ 

605840 

0 

CANN 

4 

52363 

151460 

■HIM 

16 

171764 

37865 

32 

239646 

7489 

18932 

1 

0 

0 

46804 

■■ 

DWT 

4 

7599 

1900 

11701 

16 

17867 

1117 

2925 

32 

20990 

656 

1462 

1 

0 

0 

434577 

0 

LAP 

4 

42663 

10665 

108644 

wnM 

16 

133720 

8357 

27161 

32 

177625 

5551 

13580 

1 

0 

501570 

KM 

LSHP 

4 

46347 

11586 

125392 

03 

16 

146322 

9145 

31348 

03 

32 

192977 

15674 

IBI 

Table  5:  Wrap  mapping. 
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represented  as  directed  acyclic  graphs  with  sufficient  information  prior  to  performing 
the  computations. 

To  analyze  the  effects  on  the  performance  of  the  partitioning  and  scheduling  tech¬ 
nique  used,  we  have  compared  the  communication  overhead  in  the  form  of  total  data 
traffic  with  that  obtained  from  an  implementation  where  a  straightforward  column 
wrap  scheme  is  used.  Five  representative  test  matrices  from  the  Harwell-Boeing  pack¬ 
age  were  used  for  this  purpose.  The  comparison  shows  that  the  block-based  scheme 
results  in  a  significant  reduction  in  the  communication  overhead  as  compared  to  the 
wrap-mapping  scheme.  This  is  in  agreement  with  our  motivation  for  blocking.  On  the 
other  hand,  the  block  method  results  in  more  load  imbalance.  Wrap-mappings  usu¬ 
ally  lead  to  processors  communicating  with  a  large  number  of  other  processors  leading 
to  a  large  amount  of  data  traffic  and  possibly  to  hot-spots.  However,  in  block-based 
schemes,  most  of  the  communication  among  blocks  occur  within  a  cluster  and  hence 
can  mostly  be  confined  to  small  groups  of  processors.  Although  the  increased  load 
imbalance  is  a  serious  issue,  the  provision  of  the  parameters  such  as  the  grain  size  and 
the  cluster  widths  allows  one  to  minimize  the  load  imbalance  for  particular  applica¬ 
tions.  Further  study  of  the  structure  of  the  sparse  matrices  is  required  to  optimize 
these  parameters  for  individual  applications.  Moreover,  in  real  applications  factoring 
is  only  a  part  of  the  overall  solution  of  the  system  and  other  computations  such  as 
triangular  solves  can  provide  additional  flexibility  in  the  balancing  the  load  which  is 
not  taken  into  account  here.  Finally,  more  sophisticated  scheduling  strategies  could 
be  used  to  improve  performance.  Thus,  for  systems  such  as  message  passing  archi¬ 
tectures,  where  communication  overhead  is  much  more  expensive  than  computation, 
automated,  block-based  methods  such  as  the  one  described  here  may  prove  to  be 
better  alternatives. 
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